The objective of this analysis is to identify potential data quality issues with in coming Kenya soil health round 1 study data so that the enumeration team can address issues in the field and improve overall quality. This file will also serve as (1) an initial cleaning file and (2) an attrition and balance check to inform when to wrap up enumeration.
- We have likely outliers in the numeric data. Follow up with enumerators about these values.
- Many round 1 GPS points are not close to the baseline value. Follow up about plot continuity and GPS quality. (see Kenya round 1 data check)
rm(list = ls())
cat("\014")
## set up some global options
# always set stringsAsFactors = F when loading data
options(stringsAsFactors=FALSE)
# show the code
knitr::opts_chunk$set(echo = TRUE)
# define all knitr tables to be html format
options(knitr.table.format = 'html')
# change code chunk default to not show warnings or messages
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
libs <- c("dplyr", "reshape2", "knitr", "ggplot2", "tibble", "readxl",
"MASS", "gridExtra", "cowplot", "robustbase", "car", "RStata", "foreign",
"tidyr", "readxl", "stringr", "sp", "dismo", "leaflet", "XML", "ggmap", "knitr",
"MASS")
lapply(libs, require, character.only = T, quietly = T, warn.conflicts = F)
[[1]]
[1] TRUE
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] TRUE
[[5]]
[1] TRUE
[[6]]
[1] TRUE
[[7]]
[1] TRUE
[[8]]
[1] TRUE
[[9]]
[1] TRUE
[[10]]
[1] TRUE
[[11]]
[1] TRUE
[[12]]
[1] TRUE
[[13]]
[1] TRUE
[[14]]
[1] TRUE
[[15]]
[1] TRUE
[[16]]
[1] TRUE
[[17]]
[1] TRUE
[[18]]
[1] TRUE
[[19]]
[1] TRUE
[[20]]
[1] TRUE
[[21]]
[1] TRUE
[[22]]
[1] TRUE
[[23]]
[1] TRUE
#### define helpful functions
# define function to adjust table widths
html_table_width <- function(kable_output, width) {
width_html <- paste0(paste0('<col width="', width, '">'), collapse = "\n")
sub("<table>", paste0("<table>\n", width_html), kable_output)
}
source("../oaflib/commcareExport.R")
source("../oaflib/misc.R")
I’m going to use the CommCare API created by Robert On to access the data directly from CommCare. This ensures that there are no changes to the data between the point of access and the point of cleaning and analysis
forceUpdateAll = F
forceUpdate = forceUpdateAll
d <- getFormData("harvest", "Soil Sampling", "Soil Sampling 2017", forceUpdate)
[1] "found 677cf904ae87a358b18d0ad453b747d"
Duplicated column names deduplicated: 'District Name' => 'District Name_1' [20], 'Site Name' => 'Site Name_1' [21], 'Last Name' => 'Last Name_1' [22], 'First Name' => 'First Name_1' [23], 'Client Phone' => 'Client Phone_1' [24], 'What seed type did you use on this plot?' => 'What seed type did you use on this plot?_1' [53], 'How many kgs of seed did you use?' => 'How many kgs of seed did you use?_1' [54], 'Measurement of the reported yield/quantity' => 'Measurement of the reported yield/quantity_1' [56], 'What fertilizer did you use on your ____ for the long rains 2016?' => 'What fertilizer did you use on your ____ for the long rains 2016?_1' [63], 'How many KGs of CAN did you apply to the ____?' => 'How many KGs of CAN did you apply to the ____?_1' [65], 'How many KGs of NPK did you apply to the ____?' => 'How many KGs of NPK did you apply to the ____?_1' [66], 'How many KGs of urea did you apply to the ____?' => 'How many KGs of urea did you apply to the ____?_1' [67]Parsed with column specification:
cols(
.default = col_character(),
metadata.deviceID = col_double(),
metadata.timeStart = col_datetime(format = ""),
metadata.timeEnd = col_datetime(format = ""),
`Are you or household member a member of OAF 2017?` = col_integer(),
`OAF ID` = col_integer(),
`Respondent's approximate age` = col_integer(),
`Respondent's Gender` = col_integer(),
`Number of people in this Household(Include respondent).` = col_integer(),
`How many seasons have you participated with One Acre Fund, including this long rains 2017 season?` = col_integer(),
`How many cows do you own?` = col_integer(),
`How many goats do you own?` = col_integer(),
`How many chickens do you own?` = col_integer(),
`How many pigs do you own?` = col_integer(),
`How many sheep do you own?` = col_integer(),
`Please choose the closest field size` = col_double(),
`How many kgs of seed did you use?` = col_double(),
`What was the yield for this plot?` = col_double(),
`How would you describe your long rains 2016 harvest in this field compared to previous long rains seasons?` = col_integer(),
`How many kgs of seed did you use?_1` = col_double(),
`What was the yield for this intercrop?` = col_double()
# ... with 12 more columns
)
See spec(...) for full column specifications.
# add in force update to get latest report
baselineDir <- normalizePath(file.path("..", "ke_baseline", "data"))
b <- read.csv(file=paste0(baselineDir, "/shs ke baseline.csv")) # obj d
Here I access the soil predictions from the OAF soil lab. Patrick Bell manages the lab and Mike Barber oversees the prediction scripts.
soilDir <- normalizePath(file.path("..", "..", "OAF Soil Lab Folder", "Projects", "ke_shs_17", "4_predicted", "other_summaries"))
soil <- read.csv(file=paste(soilDir, "combined-predictions-including-bad-ones.csv", sep = "/"))
idDir <- normalizePath(file.path("..", "..", "OAF Soil Lab Folder", "Projects", "ke_shs_17", "5_merged"))
Identifiers <- read_excel(paste(idDir,"Soil Sampling - Sample reception 2017 - Sample reception 20172017-04-12 (1).xlsx",sep="/"), sheet=1) %>%
mutate(
reception_barcode = gsub("_", "", reception_barcode)
)
soil$SSN[!soil$SSN %in% Identifiers$reception_barcode]
[1] "oafke002126"
Identifiers$reception_barcode[!Identifiers$reception_barcode %in% soil$SSN]
[1] "oafke0001485" "oafke0002126"
It seems to be the case that one barcode is not matching due to the number of zeros. Correct and update.
soil$SSN[soil$SSN=="oafke002126"] <- "oafke0002126"
There’s one record in the Identifiers that doesn’t appear in the data. An extra sample without a survey? Seems odd.
The variable names are out of control. In order to make the variable names more user friendly, I went to CommCare and downloaded the form contents that includes the variable name. I then wrote in a more user friendly variable name in the file. I import and overwrite the variable names here. Perhaps there’s a better way to do this but it’s easier to log the changes in excel than to write out all the code. The reference files will be on hand in case I accidentally mislabel a variable. The variable names in the excel file start with variable 11 from CommCare. The first 10 CommCare variables are meta data which we will keep.
General TODO
newName <- read_excel("var_names.xlsx", sheet=1)
qTypes <- c("Multiple Choice", "Phone Number or Numeric ID", "Checkbox", "Text", "Decimal", "Image Capture", "Barcode Scan", "GPS", "Integer")
newName <- newName %>% dplyr::select(1:4
) %>% dplyr::filter(newName$Type %in% qTypes) %>% as.data.frame()
newName <- newName %>% filter(new.var.name!="general.comment")
Keep the participation variable in d so we understand who agreed to participate. Make additional changes to d to prepare it to be combined with the meta information from CommCare.
names(d)[26] <- "participation"
names(d)[names(d)=="#form/id_base/Soil_Sample_Id"] <- "Soil_Sample_Id"
names(d)[names(d)=="#form/id_2017new/Soil_Sample_Id"] <- "New_soil_sample_id"
Merge in variable names to check that they match and then replace names(). I’m subtracting two because the Soil_Sample_Id variables appear out of order.
I’m also adding some variables to the new name vector as d now has a different variable length
newNameVec <- c(newName$new.var.name[1], "agree", "interviewed.2016", "refusal", newName$new.var.name[2:(length(newName$new.var.name)-1)], "comment", newName$new.var.name[length(newName$new.var.name)])
names(d)[11:(length(d)-3)] <- newNameVec
# drop vars with drop
d <- d[,-which(grepl("drop.", names(d)))]
Make new variables
#convert acreage to m2 and then calculate fertilizer per acre
d$plot.m2 <- d$plot.size*4046
m2ToAcres <- function(input, meters) {
res <- (input/meters)*4046
return(res)
}
d$can.acre <- m2ToAcres(d$can.main, d$plot.m2)
d$dap.acre <- m2ToAcres(d$dap.main,d$plot.m2)
d$npk.acre <- m2ToAcres(d$npk.main,d$plot.m2)
d$urea.acre <- m2ToAcres(d$urea.main,d$plot.m2)
d$compost.acre <- m2ToAcres(d$compost.kgs,d$plot.m2)
d$seed.acre <- m2ToAcres(d$seed.kgs,d$plot.m2)
d$intercrop.seed.acre <- m2ToAcres(d$intercrop.seed.kgs, d$plot.m2)
d$intercrop.dap.acre <- m2ToAcres(d$dap.intercrop, d$plot.m2)
d$intercrop.can.acre <- m2ToAcres(d$can.intercrop, d$plot.m2)
d$intercrop.npk.acre <- m2ToAcres(d$npk.intercrop, d$plot.m2)
d$intercrop.urea.acre <- m2ToAcres(d$urea.intercrop, d$plot.m2)
Yield calculation - first let’s confirm that when the yield.unit is NA that there isn’t a yield value.
table(d$yield[is.na(d$yield.unit)], useNA = 'ifany')
0 <NA>
24 56
TODO - Charles: I assume a 0 yield is a true value. Or should it be NA? The code below converts all yield values associated with an NA unit to NA when in fact there should be some 0s.
d$yield.kg <- ifelse(d$yield.unit=="100kg", d$yield*100,
ifelse(d$yield.unit=="50kg", d$yield*50,
ifelse(d$yield.unit=="90kg", d$yield*90,
ifelse(d$yield.unit=="GG", d$yield*2, NA))))
Here I take the follow up responses from enumerators and update the values in the the full data. I only want the outlier checks to show values that we haven’t already addressed through soliciting additional feedback from enumerators and farmers. This code assumes the data comes in with rows as farmers and columns as the questions that needed to be addressed. I’ve reshaped it to include
TODO - come back and figure out how to quickly add in the phone data
#updateDir <- paste(getwd(), "enum_summaries", sep = "/")
# updates <- read.csv("Phone calls survey.csv", header=T, stringsAsFactors = F) %>% rename(
# Soil_Sample_Id = soil_sample_id,
# age = Age.of.respondent,
# can.intercrop = How.many.KGs.of.CAN.did.you.apply.to.the..intercrop.,
# can.main = How.many.KGs.of.CAN.did.you.apply.to.the.main.crop.,
# chickens = Number.of.chickens.How.many.chickens.do.you.own..chickens,
# dap.intercrop = How.many.KGs.of.DAP.did.you.apply.to.the.intercrop.,
# dap.main = How.many.KGs.of.DAP.did.you.apply.to.the.main.crop.,
# hhsize = Number.of.people.in.this.Household.Include.respondent..,
# intercrop.seed.kgs = How.many.kgs.of.seed.did.you.use.in.your.intercrop.,
# lime.kgs = How.many.KGs.of.lime.did.you.use.on.this.plot.,
# npk.main = How.many.KGs.of.NPK.did.you.apply.to.the.maincrop.,
# plot.size = What.is.your.aproximate.plotsize,
# seed.kgs = How.many.kgs.of.seed.did.you.use.in.the.Maincrop.,
# sheep = How.many.sheep.do.you.own.,
# urea.main = How.many.KGs.of.urea.did.you.apply.to.the.maincrop.,
# yield = What.was.the.yield.for.this.plot.main.crop.
# ) %>% mutate(
# yield = ifelse(yield=="N/A", NA, ifelse(grepl("Not", yield), NA, yield)),
# yield = ifelse(yield=="32 bags of 90 kgs", "32 bags", yield)
# )
#
# updates$yield.unit <- sapply(updates$yield, function(x){ # separate out the unit
# strsplit(x, " ")[[1]][2]})
#
# updates$yield.kg <- sapply(updates$yield, function(x){ # keep only the number
# strsplit(x, " ")[[1]][1]})
#
# updates$yield.kg <- ifelse(grepl("bag", updates$yield.unit), as.numeric(updates$yield)*90, ifelse(grepl("goro", updates$yield.unit), as.numeric(updates$yield)*2.45, updates$yield))
#
#
# updates <- melt(updates, id.vars = c("Soil_Sample_Id", "Comments"), measure.vars = names(updates)[6:19])
# updates <- updates %>% filter(!is.na(value) | value!= " ")
# check that this works:
#print(updates[1,])
TODO - check into how many merges we’re not getting. Finalize merge report.
d$sampling.barcode <- gsub("_", "", d$sampling.barcode)
names(soil)[names(soil)=="SSN"] <- "sampling.barcode"
d <- left_join(d, soil, by="sampling.barcode")
soilVars <- names(soil)[2:21]
Check all numeric variables for outliers. First, it’s probably safe to replace all -99s with NA
TODO - set up outlier checks
catVars <- names(d)[sapply(d, function(x){
is.character(x)
})]
enumClean <- function(dat, x, toRemove){
dat[,x] <- ifelse(dat[,x] %in% toRemove, NA, dat[,x])
return(dat[,x])
}
strTable <- function(dat, x){
varName = x
tab = as.data.frame(table(dat[,x], useNA = 'ifany'))
tab = tab[order(tab$Freq, decreasing = T),]
end = ifelse(length(tab$Var1)<10, length(tab$Var1), 10)
repOrder = paste(tab$Var1[1:end], collapse=", ")
out = data.frame(variable = varName,
responses = repOrder)
return(out)
}
# clean up known values
catEnumVals <- c("-99", "-88", "- 99", "-99.0", "88", "_88", "- 88", "0.88",
"--88", "__88", "-88.0", "99.0")
d[,catVars] <- sapply(catVars, function(y){
d[,y] <- enumClean(d,y, catEnumVals)
})
responseTable <- do.call(rbind, lapply(catVars, function(x){
strTable(d, x)
}))
A simple table to preview the values in the data. The values are ranked by frequency.
kable(responseTable, format="markdown")
| variable | responses |
|---|---|
| AppFormId | /Users/mlowes/drive/data/commcare_cache/677cf904ae87a358b18d0ad453b747d |
| id | 000ce8ff-56af-4af1-8ab3-dcaaa54fe230, 0011522b-761a-45a4-9783-c217f1e7d4d6, 00135a79-0c27-4b51-89c5-fbd6d5494e06, 0033f926-d5aa-40d5-9279-0bba42d3189b, 00723ce4-7aae-4b17-924c-54885d57a8a0, 008b1760-6518-489a-902b-c6b5d8673cf0, 00905142-2f7a-4231-a32b-0c6f9fb51261, 00b06cd5-846b-4e1b-9ff8-3fd4d7f5a37d, 00b8ff5b-32f8-4909-85e6-a81aaf2c399e, 00bc4021-2420-41c5-838f-2b97646bd24b |
| domain | harvest |
| date_header | NA |
| metadata.userID | 280c99abbb2534baa8bd95c41c7fb47d, 036a641a60414b75155e0e4fefdf01bb, 74acd8230c1733e972169cfc7113229b, d1cb024846d35c496780e0a681cb9f92, d1cb024846d35c496780e0a681cc03c9, 942a528e8cc2dc1f7f060d152839394a, d1cb024846d35c496780e0a681cca808, bc1170b650390a876f8abb9571669788, 2ed6f14f0f5a764c0d911419565ee8ff, 28ccf07370af4faa5f15ecd64bd21c1c |
| metadata.username | eric.k, rachel.g, john.s, stephen.n, lazarus.m, clinton.n, dennis.k, gibson.r, florence.o, wallace.n |
| form.case.@case_id | 43abb59751f94d51acf171d595651fa0, 548a04af36fc4790a272abcfb0c1c7e9, 54ecedd1d1074986a23d2c777d127fd3, 6fb5ac28647742e598e104db43d5b300, 8bd82fadf72542b282443cdb2e1f735b, ef006c5dc2e64d93a7b27c3aefe50009, 000e81e8014a4b8bb569f7616b3f823e, 00329d4fe2304e278c438a73757cbc7c, 0042a665dafd481d9d4f6aac8cd19b2d, 005b4ac31e5e422387906b0910c15eba |
| same.hh | Yes, No, NA |
| agree | NA, yes, no |
| interviewed.2016 | NA, No, Yes |
| refusal | NA, refused |
| region | Western Province, Nyanza, NA, Nyanza1, Western Provinc, Western Province's, Western ProvinceE, Western Provincer |
| district | Gem, Butere, Rachuonyo, Kakamega (South), Lugari, Migori, Kakamega B (North), Rongo, Kisii, Busia |
| site | NA, Kamroth Lower, Kokwanyo West, Sigomere, lureko, Achuth, Kajulu, emangale, Kanyasrega, Kisatiru |
| phone | NA, 0, 0717356420, 0701313242, 0703490814, 0705337016, 0719642096, 0726211155, 0792051319, 712115910 |
| district1 | NA, Rongo, Rachuonyo, Kakamega B (North), Lugari, Suneka, Chwele, Gem, Matete, Busia |
| site1 | NA, Kamuga, Nyang'inja, riana, Bunjosi, Burinda, Chekalini, Gongo, IKURUMA, Isongo |
| participation | consents, NA |
| which.seasons | NA, lr2016, lr2016 lr2015, lr2016 lr2015 lr2014, lr2016 lr2015 lr2014 lr2013, lr2016 lr2015 lr2014 lr2013 lr2012, lr2016 lr2015 lr2014 lr2013 lr2012 lr2011, lr2016 lr2015 lr2014 lr2013 lr2012 lr2011 lr2010, lr2015, lr2016 lr2014 |
| plot.location | The plot is behind the farmers house, The plot is opposite the homestead of the farmer, Besides the farmer's house, behind farmers house, The plot is behind the farmer's homestead, The plot is besides the farmers house, The plot is immediately behind the farmer's house., The plot is infront of the farmers house, The plot is just behind the farmers house, behind farmers house. |
| own | Yes, No, NA |
| relation.land | NA, renting, borrowing, leasing |
| own.2016 | Yes, No, NA |
| most.farming | Female, both, Male, NA |
| main.crop | maize, NA, fallow, other, sorghum, sugar, beans, sukuma, groundnuts |
| main.crop.specify | NA, Nappier grass, tobacco, cassava, cassavas, Napier grass, nappier grass, Pineapple, Tobacco |
| seed.type | hybrid, local, NA |
| yield.unit | 90kg, GG, 100kg, 50kg, NA |
| intercrop | beans, 0, NA, groundnuts, soya, maize, sorghum, sugar, other, sweetpotatoes |
| intercrop.specify | NA, cassava |
| intercrop.seed.type | local, NA, hybrid |
| intercrop.yield.unit | GG, NA, 90kg, 100kg, 50kg |
| fertilizer.main | dap can, dap, none, NA, dap urea, dap can urea, can, can npk, can urea, other |
| fertilizer.intercrop | none, NA, dap, dap can, can, other, dap none, urea, npk, dap other |
| compost.quality | NA, good, average, poor |
| compost.type | NA, cow, plant_residue, goat, chicken, kitchen_waste, pig |
| lime | no, NA, yes_oaf, yes_outside |
| fertility.comp | average, morefertile, lessfertile, NA |
| field.location | hillside, valley, hilltop, NA |
| soil.type | loam, sandy, clay, NA |
| soil.color | brown, black, red, NA |
| erosion | 0, napier, grad_terrace, drainage, radical_terrace, NA |
| fallow.yn | No, Yes, NA, dont_know |
| fallow.cover | NA, weeds, grasses, legumes, mixed_grasses&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;legumes, Bare_fallow, mixed_grasses&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;legumes, mixed_grasses&amp;amp;amp;amp;amp;amp;amp;amp;legumes |
| residues | planted_into_residues, fed_to_livestock, no_incorporation, incorporated_by_hand, sold_residue, NA |
| photo | NA, 1325379545882.jpg, 1325379607350.jpg, 1325380848958.jpg, 1486093019962.jpg, 1486096823433.jpg, 1486097641286.jpg, 1486176674776.jpg, 1486177274444.jpg, 1486182820463.jpg |
| sampling.barcode | oafke0000321, oafke0000325, oafke0000661, oafke0000940, oafke0000966, oafke0001085, oafke0002000, oafke0002265, oafke0000001, oafke0000002 |
| comment | NA, No comment, N/A, no comment, none, No comments, non, no comments, 0, no |
| gps | -0.5056573852841059 34.73783778147609 1390.63 15.0, -0.6098935 34.6105897 0.0 3669.0, -0.5056573852841059 34.73783778147609 1390.63 10.0, -0.5497645 34.6277074 0.0 4821.0, -0.5843478 34.6279001 0.0 4799.0, -0.7390152 34.5656951 0.0 4553.0, -0.5504687967967642 34.73721033097699 1509.32 15.0, -0.5906037 34.7379935 0.0 3000.0, -0.598296 34.7092347 0.0 4807.0, -0.6102046 34.5573404 0.0 3965.0 |
| #form/Sampling2017_Complete | Complete, NA |
| Soil_Sample_Id | NA, Butere61451, c1019, Kisii43379, Busia11633, Busia11638, Busia11913, Busia12588, Busia12982, Busia13182 |
| New_soil_sample_id | NA, Busia25200, Butere15379, c1021, c1106, c1126, c1138, c1140, c1149, c1176 |
repGraphs <- function(dat, x){
tab = as.data.frame(table(dat[,x], useNA = 'ifany'))
tab = tab[order(tab$Freq, decreasing = T),]
print(
ggplot(data=tab, aes(x=Var1, y=Freq)) + geom_bar(stat="identity") +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title =paste0("Composition of variable: ", x))
)
}
adminVars <- c(names(d)[grep("meta", names(d))],"enum_name", "photo", "participation", "refusal", "phone", "comment", "gps", "Soil_Sample_Id", "sampling.barcode", "id", "domain", "date_header", "form.case.@case_id", "site", "district1", "site1", "plot.location", "New_soil_sample_id", "#form/Sampling2017_Complete", "AppFormId")
nonAdminVars <- catVars[!catVars %in% adminVars]
for(i in 1:length(nonAdminVars)){
repGraphs(d, nonAdminVars[i])
}
numVars <- names(d)[sapply(d, function(x){
is.numeric(x)
})]
Basic cleaning of known issues like enumerator codes for DK, NWR, etc.
enumVals <- c(-88,-85, -99)
d[,numVars] <- sapply(numVars, function(y){
d[,y] <- enumClean(d,y, enumVals)
})
iqr.check <- function(dat, x) {
q1 = summary(dat[,x])[[2]]
q3 = summary(dat[,x])[[5]]
iqr = q3-q1
mark = ifelse(dat[,x] < (q1 - (1.5*iqr)) | dat[,x] > (q3 + (1.5*iqr)), 1,0)
tab = rbind(
summary(dat[,x]),
summary(dat[mark==0, x])
)
return(tab)
}
# remove admin vars
numAdminVars <- c("metadata.deviceID", "oafid", "X")
numVarsNotAdmin <- numVars[!numVars %in% numAdminVars]
iqrTab <- do.call(plyr::rbind.fill, lapply(numVarsNotAdmin, function(y){
#print(y)
res = iqr.check(d, y)
#print(dim(res))
out = data.frame(var=rbind(y, paste(y, ".iqr", sep="")), res)
return(out)
}))
iqrTab[,2:8] <- sapply(iqrTab[,2:8], function(x){round(x,1)})
The outlier table summarizes the numeric variables with and without IQR outliers to show how the data changes based on this filter.
TODO - check the composition of the fertilizer/acre variables.
knitr::kable(iqrTab, row.names = F, digits = 0, format = 'markdown')
| var | Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. | NA.s |
|---|---|---|---|---|---|---|---|
| oaf | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| oaf.iqr | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| age | 5 | 35 | 45 | 101 | 56 | 63820 | 6 |
| age.iqr | 5 | 35 | 45 | 46 | 56 | 87 | 6 |
| respondent.gender | 0 | 0 | 1 | 1 | 1 | 1 | 2 |
| respondent.gender.iqr | 0 | 0 | 1 | 1 | 1 | 1 | 2 |
| hhsize | 0 | 4 | 6 | 6 | 7 | 66 | 2 |
| hhsize.iqr | 0 | 4 | 6 | 6 | 7 | 11 | 2 |
| seasons.oaf | 0 | 0 | 1 | 2 | 3 | 9 | 2 |
| seasons.oaf.iqr | 0 | 0 | 1 | 2 | 3 | 7 | 2 |
| cows | 0 | 0 | 2 | 2 | 3 | 20 | 1 |
| cows.iqr | 0 | 0 | 2 | 2 | 3 | 7 | 1 |
| goats | 0 | 0 | 0 | 1 | 0 | 14 | 1 |
| goats.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| chickens | 0 | 3 | 6 | 9 | 11 | 300 | 1 |
| chickens.iqr | 0 | 3 | 6 | 7 | 10 | 23 | 1 |
| pigs | 0 | 0 | 0 | 0 | 0 | 30 | 8 |
| pigs.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
| sheep | 0 | 0 | 0 | 0 | 0 | 25 | 1 |
| sheep.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| plot.size | 0 | 0 | 0 | 0 | 0 | 2 | 2 |
| plot.size.iqr | 0 | 0 | 0 | 0 | 0 | 1 | 2 |
| seed.kgs | 0 | 3 | 4 | 5 | 6 | 100 | 56 |
| seed.kgs.iqr | 0 | 3 | 4 | 4 | 6 | 10 | 56 |
| yield | 0 | 2 | 5 | 17 | 9 | 1120 | 56 |
| yield.iqr | 0 | 2 | 4 | 5 | 7 | 19 | 56 |
| harvestcomp.2016 | 1 | 1 | 2 | 2 | 2 | 3 | 48 |
| harvestcomp.2016.iqr | 1 | 1 | 2 | 2 | 2 | 3 | 48 |
| intercrop.seed.kgs | 0 | 2 | 4 | 5 | 6 | 60 | 810 |
| intercrop.seed.kgs.iqr | 0 | 2 | 4 | 5 | 6 | 12 | 810 |
| intercrop.yield | 0 | 3 | 8 | 12 | 20 | 104 | 812 |
| intercrop.yield.iqr | 0 | 2 | 8 | 11 | 20 | 45 | 812 |
| intercrop.harvestcomp | 1 | 1 | 2 | 2 | 2 | 3 | 805 |
| intercrop.harvestcomp.iqr | 1 | 1 | 2 | 2 | 2 | 3 | 805 |
| dap.main | 0 | 10 | 25 | 23 | 25 | 200 | 240 |
| dap.main.iqr | 0 | 10 | 16 | 18 | 25 | 46 | 240 |
| can.main | 0 | 12 | 25 | 24 | 25 | 200 | 677 |
| can.main.iqr | 0 | 10 | 25 | 18 | 25 | 42 | 677 |
| npk.main | 3 | 20 | 50 | 37 | 50 | 100 | 1924 |
| npk.main.iqr | 3 | 16 | 38 | 32 | 50 | 50 | 1924 |
| urea.main | 1 | 4 | 6 | 12 | 10 | 100 | 1878 |
| urea.main.iqr | 1 | 4 | 5 | 5 | 8 | 12 | 1878 |
| dap.intercrop | 0 | 5 | 10 | 13 | 12 | 125 | 1658 |
| dap.intercrop.iqr | 0 | 4 | 8 | 7 | 10 | 20 | 1658 |
| can.intercrop | 0 | 5 | 8 | 12 | 12 | 125 | 1856 |
| can.intercrop.iqr | 0 | 4 | 6 | 7 | 10 | 20 | 1856 |
| npk.intercrop | 25 | 31 | 38 | 38 | 44 | 50 | 1935 |
| npk.intercrop.iqr | 25 | 31 | 38 | 38 | 44 | 50 | 1935 |
| urea.intercrop | 0 | 2 | 3 | 3 | 4 | 5 | 1933 |
| urea.intercrop.iqr | 0 | 2 | 3 | 3 | 4 | 5 | 1933 |
| compost.kgs | 0 | 0 | 0 | 77 | 80 | 12000 | 54 |
| compost.kgs.iqr | 0 | 0 | 0 | 33 | 50 | 200 | 54 |
| lime.kgs | 4 | 10 | 12 | 29 | 25 | 150 | 1901 |
| lime.kgs.iqr | 4 | 9 | 12 | 15 | 23 | 38 | 1901 |
| slope | 0 | 5 | 5 | 7 | 10 | 45 | 2 |
| slope.iqr | 0 | 5 | 5 | 6 | 10 | 17 | 2 |
| plot.m2 | 253 | 1012 | 2023 | 2113 | 2023 | 8092 | 2 |
| plot.m2.iqr | 253 | 1012 | 2023 | 1578 | 2023 | 3034 | 2 |
| can.acre | -132 | 32 | 50 | 47 | 50 | 200 | 677 |
| can.acre.iqr | 5 | 25 | 50 | 41 | 50 | 77 | 677 |
| dap.acre | -132 | 32 | 50 | 47 | 50 | 400 | 240 |
| dap.acre.iqr | 5 | 25 | 50 | 40 | 50 | 77 | 240 |
| npk.acre | 6 | 33 | 50 | 62 | 100 | 200 | 1924 |
| npk.acre.iqr | 6 | 33 | 50 | 62 | 100 | 200 | 1924 |
| urea.acre | 1 | 8 | 16 | 21 | 24 | 100 | 1878 |
| urea.acre.iqr | 1 | 8 | 16 | 15 | 20 | 40 | 1878 |
| compost.acre | -1584 | 0 | 0 | 187 | 180 | 48000 | 37 |
| compost.acre.iqr | -198 | 0 | 0 | 61 | 80 | 450 | 37 |
| seed.acre | -79 | 8 | 10 | 12 | 12 | 640 | 55 |
| seed.acre.iqr | 3 | 8 | 9 | 10 | 12 | 18 | 55 |
| intercrop.seed.acre | -79 | 7 | 9 | 12 | 16 | 120 | 809 |
| intercrop.seed.acre.iqr | 0 | 6 | 8 | 10 | 14 | 30 | 809 |
| intercrop.dap.acre | -198 | 10 | 20 | 25 | 40 | 100 | 1657 |
| intercrop.dap.acre.iqr | 0 | 10 | 20 | 24 | 40 | 80 | 1657 |
| intercrop.can.acre | -132 | 8 | 20 | 24 | 40 | 100 | 1855 |
| intercrop.can.acre.iqr | 0 | 9 | 20 | 24 | 37 | 64 | 1855 |
| intercrop.npk.acre | 25 | 31 | 38 | 38 | 44 | 50 | 1935 |
| intercrop.npk.acre.iqr | 25 | 31 | 38 | 38 | 44 | 50 | 1935 |
| intercrop.urea.acre | 0 | 6 | 9 | 8 | 12 | 16 | 1933 |
| intercrop.urea.acre.iqr | 0 | 6 | 9 | 8 | 12 | 16 | 1933 |
| yield.kg | 4 | 135 | 300 | 489 | 540 | 40000 | 80 |
| yield.kg.iqr | 4 | 135 | 270 | 340 | 495 | 1120 | 80 |
| pH | 0 | 5 | 6 | 6 | 6 | 8 | 34 |
| pH.iqr | 4 | 5 | 6 | 6 | 6 | 6 | 34 |
| X.EC..Salts. | 31 | 64 | 74 | Inf | 84 | Inf | 34 |
| X.EC..Salts..iqr | 34 | 64 | 74 | 74 | 84 | 115 | 34 |
| Phosphorus | 2 | 7 | 10 | Inf | 15 | Inf | 34 |
| Phosphorus.iqr | 2 | 7 | 10 | 11 | 14 | 28 | 34 |
| Potassium | 0 | 104 | 142 | 162 | 204 | 1104 | 34 |
| Potassium.iqr | 0 | 103 | 140 | 155 | 197 | 353 | 34 |
| Calcium | 0 | 581 | 901 | 1039 | 1292 | 8131 | 34 |
| Calcium.iqr | 0 | 568 | 872 | 936 | 1223 | 2337 | 34 |
| Magnesium | 0 | 102 | 151 | 172 | 216 | 964 | 34 |
| Magnesium.iqr | 0 | 100 | 148 | 162 | 209 | 387 | 34 |
| Sulphur | 4 | 9 | 11 | Inf | 12 | Inf | 34 |
| Sulphur.iqr | 5 | 9 | 11 | 11 | 12 | 16 | 34 |
| Copper | 0 | 2 | 3 | 4 | 5 | 16 | 34 |
| Copper.iqr | 0 | 2 | 3 | 4 | 5 | 9 | 34 |
| Boron | 0 | 0 | 0 | 0 | 0 | 1 | 34 |
| Boron.iqr | 0 | 0 | 0 | 0 | 0 | 1 | 34 |
| Zinc | 0 | 5 | 6 | 6 | 8 | 18 | 34 |
| Zinc.iqr | 2 | 5 | 6 | 6 | 8 | 11 | 34 |
| X.Sodium | 0 | 4 | 7 | Inf | 12 | Inf | 34 |
| X.Sodium.iqr | 0 | 4 | 6 | 8 | 11 | 24 | 34 |
| Iron | 94 | 132 | 144 | Inf | 158 | Inf | 34 |
| Iron.iqr | 94 | 132 | 144 | 145 | 156 | 196 | 34 |
| X.C.E.C | 0 | 7 | 10 | 11 | 14 | 49 | 34 |
| X.C.E.C.iqr | 0 | 7 | 10 | 11 | 14 | 23 | 34 |
| X.Exchangeable.Aluminium | 0 | 0 | 0 | Inf | 0 | Inf | 34 |
| X.Exchangeable.Aluminium.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
| X.Exchangeable.Acidity | 0 | 0 | 0 | Inf | 0 | Inf | 34 |
| X.Exchangeable.Acidity.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
| X.Phosphorus.Sorption.Index..PSI. | 0 | 83 | 118 | 127 | 161 | 411 | 34 |
| X.Phosphorus.Sorption.Index..PSI..iqr | 0 | 82 | 116 | 123 | 158 | 274 | 34 |
| X.Organic.Carbon | 0 | 2 | 2 | 2 | 2 | 3 | 34 |
| X.Organic.Carbon.iqr | 1 | 2 | 2 | 2 | 2 | 3 | 34 |
| Manganese | 66 | 172 | 267 | 260 | 338 | 498 | 34 |
| Manganese.iqr | 66 | 172 | 267 | 260 | 338 | 498 | 34 |
| Aluminium | 522 | 941 | 1073 | 1046 | 1164 | 1508 | 34 |
| Aluminium.iqr | 609 | 946 | 1076 | 1052 | 1165 | 1495 | 34 |
| X.Total.Nitrogen | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
| X.Total.Nitrogen.iqr | 0 | 0 | 0 | 0 | 0 | 0 | 34 |
# http://rforpublichealth.blogspot.com/2014/02/ggplot2-cheatsheet-for-visualizing.html
for(i in 1:length(numVarsNotAdmin)){
base <- ggplot(d, aes(x=d[,numVarsNotAdmin[i]])) + labs(x = numVarsNotAdmin[i])
temp1 <- base + geom_density()
temp2 <- base + geom_histogram()
#temp2 <- boxplot(r[,numVars[i]],main=paste0("Variable: ", numVars[i]))
multiplot(temp1, temp2, cols = 2)
}
Before aligning them with the baseline soil values
check.3sd <- function(x) {
x = ifelse(is.infinite(x), NA, x)
mean = mean(x, na.rm=T)
sd = sd(x, na.rm=T)
mark = ifelse(x>(mean + (3*sd)) |
x<(mean - (3*sd)), NA, x)
return(mark)
}
sdSoilVals <- d %>%
dplyr::select(pH:X.Total.Nitrogen)
sdCheck <- as.data.frame(apply(sdSoilVals, 2, function(x){
return(check.3sd(x))
}))
for(i in 1:length(soilVars)){
print(ggplot(data=sdCheck, aes(x=sdCheck[,soilVars[i]])) +
geom_density() +
labs(x=soilVars[i])
)
}
Important note: I’m going to add the adjusted values to the r data frame giving the previous variables the extension .raw so I can distinguish between the original and modified data.
names(d)[which(names(d)=="pH"):which(names(d)=="X.Total.Nitrogen")] <- paste0(names(d)[which(names(d)=="pH"):which(names(d)=="X.Total.Nitrogen")], ".raw")
d <- cbind(d, sdCheck)
Check for unique ids in the round 1 data
length(d$Soil_Sample_Id)==length(unique(d$Soil_Sample_Id))
[1] FALSE
dups <- d$Soil_Sample_Id[duplicated(d$Soil_Sample_Id) & !is.na(d$Soil_Sample_Id)]
Let’s use the baseline data to resolve these duplicated ids
roundId <- d %>%
dplyr::select(district, site, Soil_Sample_Id) %>%
filter(d$Soil_Sample_Id %in% dups) %>%
filter(!is.na(district) & !is.na(Soil_Sample_Id))
The first one is marked as both OAF and not OAF in the follow up. The surveys happened on consecutive days. I’m going to keep the obersvation with the same phone number from the baseline.
# duplicated id 1
idCheck <- d[d$Soil_Sample_Id %in% dups[1],]
bCheck <- b[b$Soil_Sample_Id %in% dups[1],]
d <- d[-which(d$Soil_Sample_Id==dups[1] & d$phone!=724903366),]
I think the second survey more closely resembles the baseline version of this farmer in the data. I’m keeping the survey done by Lazarus.
#duplicated id 2
idCheck <- d[d$Soil_Sample_Id %in% dups[2],]
bCheck <- b[b$Soil_Sample_Id %in% dups[2],]
d <- d[-which(d$Soil_Sample_Id==dups[2] & d$metadata.username!="lazarus.m"),]
The age and the GPS coordinate indicate that the survey with the age of 54 is probably the right one.
#duplicated id 3
idCheck <- d[d$Soil_Sample_Id %in% dups[3],]
bCheck <- b[b$Soil_Sample_Id %in% dups[3],]
d <- d[-which(d$Soil_Sample_Id==dups[3] & d$age!=54),]
Quick check of the work. There are still Soil_Sample_Id that are missing. Most of them have a new soil sample id but not all.
TODO - check with Charles as to why someone might not have a Soil_Sample_Id from the baseline but would have a new Soil_Sample_Id in round 1.
There are not that many new soil sample ids so perhaps it was a way to add more farmers to the sample?
length(d$Soil_Sample_Id)==length(unique(d$Soil_Sample_Id))
[1] FALSE
table(is.na(d$Soil_Sample_Id))
FALSE TRUE
1872 62
#d[is.na(d$Soil_Sample_Id),]
FOR THE TIME BEING: drop those missing a soil sample id in round 1.
d <- d[-which(is.na(d$Soil_Sample_Id)),]
I’ll do a couple things:
rbindlength(b$Soil_Sample_Id)==length(unique(b$Soil_Sample_Id))
[1] FALSE
dups <- b$Soil_Sample_Id[duplicated(b$Soil_Sample_Id) & !is.na(b$Soil_Sample_Id)]
Resolve them here. They all appear to be control values that were used twice. Can I assign them a new control value? Probably not. See who they followed up with the first round to know which one to keep.
#b[b$soil_sample_id=="c330",] # not clear
#b[b$soil_sample_id=="c384",] # it's exactly the same, drop one
#b[b$soil_sample_id=="c1129",] # it's exactly the same, drop one
#b[b$soil_sample_id=="c626",] # not clear
#b[b$soil_sample_id=="c856",] # it's exactly the same, drop one
dropOne <- c("c384", "c1129", "c856")
b$n <- ave(1:length(b$Soil_Sample_Id), b$Soil_Sample_Id, FUN = seq_along)
b$drop <- ifelse(b$n==2 & b$Soil_Sample_Id %in% dropOne, 1, 0)
b <- b[-which(b$drop==1),]
dropForNow <- c("c330", "c626") # see below
TODO - confirm the data cleaning process at the baseline so that the soil data variables are being treated the same way in the baseline and round 1. They should probably be cleaned separately but should be treated the same way.
d$season <- "2016"
b$season <- "2015"
names(d) <- tolower(names(d))
names(b) <- tolower(names(b))
bUpdate <- b %>% rename(
region = regionname,
district = districtname,
site = sitename,
respondent.gender = gender, # check that this is correct
plot.location = field_location,
plot.size = field_size,
harvestcomp.2016 = yield_comparative,
main.crop = plot_information.maincrop,
main.crop.specify = plot_information.specify_other_main_crop,
seed.type = seedtype,
seed.kgs = seedkgs,
intercrop = intercrop.intercrop,
intercrop.specify = intercrop.specify_other_main_crop,
intercrop.seed.type = intercrop.seedtype,
intercrop.seed.kgs = intercrop.seedkgs,
intercrop.harvestcomp = intercrop.yield_comparative,
fertilizer.main = inputs.input_maincrop,
dap.main = inputs.dap_kg,
can.main = inputs.can_kg,
npk.main = inputs.npk_kg,
urea.main = inputs.urea_kg,
fertilizer.intercrop = inputs.input_intercrop,
dap.intercrop = inputs.dap_kg_intercrop,
can.intercrop = inputs.can_kg_intercrop,
npk.intercrop = inputs.npk_kg_intercrop,
urea.intercrop = inputs.urea_kg_intercrop,
compost.kgs = inputs.compost,
compost.quality = inputs.compost_quality,
compost.type = inputs.compost_type,
lime = inputs.lime,
lime.kgs = inputs.lime_kg,
soil.type = soil_type,
soil.color = soil_color,
erosion = anti_erosion,
field.location = location,
seasons.oaf = seasons_oaf,
x.c.e.c = c.e.c,
copper = cu,
x.ec..salts. = ec,
#x.exchangeable.aluminum = exch.al,
x.phosphorus.sorption.index..psi. = psi,
potassium = k,
magnesium = mg,
manganese = mn,
boron = b,
calcium = ca,
iron = fe,
x.sodium = na,
phosphorus = p,
sulphur = s,
zinc = zn,
x.organic.carbon = total.c,
x.total.nitrogen = total.n,
x.exchangeable.aluminium = exch.al
)
# helper to know what else needs to be changed
names(d)[!names(d) %in% names(bUpdate)]
[1] "appformid" "id" "domain"
[4] "date_header" "metadata.deviceid" "metadata.userid"
[7] "metadata.username" "metadata.timestart" "metadata.timeend"
[10] "form.case.@case_id" "same.hh" "agree"
[13] "interviewed.2016" "refusal" "phone"
[16] "district1" "site1" "participation"
[19] "hhsize" "which.seasons" "own"
[22] "relation.land" "own.2016" "most.farming"
[25] "yield.unit" "intercrop.yield.unit" "fertility.comp"
[28] "fallow.yn" "fallow.cover" "residues"
[31] "slope" "sampling.barcode" "comment"
[34] "#form/sampling2017_complete" "new_soil_sample_id" "plot.m2"
[37] "seed.acre" "intercrop.seed.acre" "intercrop.dap.acre"
[40] "intercrop.can.acre" "intercrop.npk.acre" "intercrop.urea.acre"
[43] "yield.kg" "ph.raw" "x.ec..salts..raw"
[46] "phosphorus.raw" "potassium.raw" "calcium.raw"
[49] "magnesium.raw" "sulphur.raw" "copper.raw"
[52] "boron.raw" "zinc.raw" "x.sodium.raw"
[55] "iron.raw" "x.c.e.c.raw" "x.exchangeable.aluminium.raw"
[58] "x.exchangeable.acidity.raw" "x.phosphorus.sorption.index..psi..raw" "x.organic.carbon.raw"
[61] "manganese.raw" "aluminium.raw" "x.total.nitrogen.raw"
[64] "x.exchangeable.acidity" "aluminium"
names(bUpdate)[!names(bUpdate) %in% names(d)]
[1] "ssn" "info.formid" "last_name" "first_name"
[5] "clientphone" "seasons_oaf_specify" "chemical_fertilizer_5" "why_no_chemical_fertilizer"
[9] "compost_fertilizer_5" "why_no_compost" "lime_fertilizer_5" "why_no_lime"
[13] "fertilizer_5" "legum_maincrop_5" "legum_intercrop_5" "crop_rese"
[17] "fertile_comparison" "globalclientid" "soil_sampling_complete" "info.caseid"
[21] "info.completed_time" "info.username" "info.started_time" "date.sampled"
[25] "oaf_id" "hp" "lat" "lon"
[29] "alt" "precision" "valley" "hilltop"
[33] "hillside" "crop.maize" "dap.check" "field_size.adj"
[37] "seasons_breaks" "own.cows" "own.goats" "own.chickens"
[41] "own.pigs" "own.sheep" "clay_soil" "loam_soil"
[45] "sandy_soil" "n" "drop"
The variables that remain to be aligned are the soil variables and any resulting calulations such as fertilizer per acre. These caculations can be redone once the data is combined.
Make groups of variables to make them easier to find later
historicalBaseline <- names(b)[which(names(b)=="chemical_fertilizer_5"):which(names(b)=="legum_intercrop_5")]
baselineSoilVars <- names(b)[which(names(b)=="c.e.c"):which(names(b)=="total.n")]
d <- d %>%
mutate(
soil_sample_id = tolower(gsub(" ", "", soil_sample_id))
)
commonVars <- names(d)[names(d) %in% names(bUpdate)]
write.csv(commonVars, file="varNamesforM&E.csv")
m2ToAcres <- function(input, meters) {
res <- (input/meters)*4046
return(res)
}
# check soil sample id matches before rbind
table(d$soil_sample_id %in% bUpdate$soil_sample_id)
FALSE TRUE
32 1840
d$soil_sample_id[!d$soil_sample_id %in% bUpdate$soil_sample_id]
[1] "lugari52284" "busia33787" "c1136" "gem45918" "c703"
[6] "busia34175" "butere49984" "migori14465" "kakamegab(north)61463" "rongo33010"
[11] "webuye53326" "rachuonyo22477" "c863" "rachuonyo16330" "rongo35982"
[16] "c241" "c290" "migori17557" "gem64358" "matete54668"
[21] "kisii43004" "kakamegab(north)15878" "webuye52641" "gem57424" "suneka16515"
[26] "kakamegab(north)63510" "c857" "rachuonyo42533" "rachuonyo39005" "suneka30533"
[31] "rachuonyo40077" "kakamegab(north)6413"
table(bUpdate$soil_sample_id %in% d$soil_sample_id)
FALSE TRUE
190 1842
bUpdate$soil_sample_id[!bUpdate$soil_sample_id %in% d$soil_sample_id]
[1] "busia27558" "busia25200" "c27" "c198" "c207"
[6] "c208" "c209" "c215" "c240" "matete15959"
[11] "matete57201" "c795" "c797" "c802" "c807"
[16] "c835" "c841" "kakamegab(north)45636" "kakamegab(north)54395" "kakamegab(north)55666"
[21] "kakamegab(north)57759" "c526" "c555" "c556" "c557"
[26] "c558" "c559" "c560" "c561" "c562"
[31] "c563" "c565" "c566" "c567" "c568"
[36] "c569" "c570" "c571" "c572" "c573"
[41] "c574" "c575" "c576" "c577" "c578"
[46] "c579" "c581" "c582" "c583" "c584"
[51] "c585" "c586" "c587" "c591" "c597"
[56] "c604" "lugari65731" "lugari66391" "c695" "c704"
[61] "c707" "c721" "c731" "c736" "c744"
[66] "c770" "c775" "c780" "c784" "butere15379"
[71] "butere50525" "butere56919" "butere57046" "butere55190" "butere51928"
[76] "butere56266" "c127" "c77" "c121" "c162"
[81] "c63" "c152" "c153" "c154" "c61"
[86] "c124" "c114" "c93" "c135" "c69"
[91] "c57" "c84" "c155" "c71" "c163"
[96] "gem60104" "gem51405" "gem57253" "c301" "c270"
[101] "c357" "c249" "c396" "c266" "c373"
[106] "c282" "c331" "c281" "c397" "c398"
[111] "c272" "c393" "c259" "c375" "webuye33853"
[116] "c1220" "kakamega(south)52107" "kakamega(south)55651" "kakamega(south)61334" "c490"
[121] "c499" "c475" "c459" "c470" "c410"
[126] "c488" "rachuonyo30231" "rachuonyo41225" "c1014" "c982"
[131] "c1054" "c980" "c1021" "c1055" "suneka32800"
[136] "suneka33096" "suneka30964" "suneka29770" "suneka31042" "c1159"
[141] "c1161" "c1187" "rongo27024" "rongo29226" "rongo19791"
[146] "rongo18810" "rongo33013" "c1133" "c1117" "c1137"
[151] "c1140" "c1106" "c1138" "c1134" "c1125"
[156] "c1118" "c1126" "c1149" "c1147" "c1095"
[161] "c1135" "c1139" "c1143" "kisii41906" "kisii41862"
[166] "c643" "c682" "c665" "migori5694" "migori18384"
[171] "migori9269" "migori18010" "migori10391" "c882" "c898"
[176] "c899" "c910" "c928" "c884" "c960"
[181] "c935" "c922" "c914" "c936" "c867"
[186] "c880" "c870" "c883" "c858" "c1240"
fieldDat <- rbind(bUpdate[,commonVars], d[,commonVars])
fieldDat <- fieldDat %>% mutate(
plot.m2 = plot.size*4046,
can.acre = m2ToAcres(can.main, plot.m2),
dap.acre = m2ToAcres(dap.main,plot.m2),
npk.acre = m2ToAcres(npk.main,plot.m2),
urea.acre = m2ToAcres(urea.main,plot.m2),
compost.acre = m2ToAcres(compost.kgs,plot.m2),
seed.acre = m2ToAcres(seed.kgs,plot.m2),
intercrop.seed.acre = m2ToAcres(intercrop.seed.kgs, plot.m2),
intercrop.dap.acre = m2ToAcres(dap.intercrop, plot.m2),
intercrop.can.acre = m2ToAcres(can.intercrop, plot.m2),
intercrop.npk.acre = m2ToAcres(npk.intercrop, plot.m2),
intercrop.urea.acre = m2ToAcres(urea.intercrop, plot.m2)
) %>% rename(
d_client = oaf,
sample_id = soil_sample_id
)
# drop the ids that are duplicates
fieldDat <- fieldDat[!fieldDat$sample_id %in% dropForNow,]
Check that categorical variable responses align between the baseline and round 1.
catVars <- names(fieldDat)[sapply(fieldDat, function(x){
is.character(x)
})]
fieldNonAdmin <- catVars[!catVars %in% c("plot.location", "gps", "sample_id", "main.crop.specify", "intercrop.specify", "photo", "site")]
fieldDat$region <- ifelse(grepl("nyan", tolower(fieldDat$region)), "Nyanza", "Western")
for(i in 1:length(fieldNonAdmin)){
repGraphs(fieldDat, fieldNonAdmin[i])
}
TODO
The aim of the analysis is to:
TODO
fieldDat %>%
dplyr::select(sample_id, season, d_client) %>%
group_by(sample_id) %>%
spread(., season, d_client) %>%
rename(
client15 = `2015`,
client16 = `2016`
) %>%
mutate(
becameClient = ifelse(client15==0 & client16==1, 1, 0),
becameControl = ifelse(client15==1 & client16==0, 1, 0),
stayedClient = ifelse(client15==1 & client16==1, 1, 0),
stayedControl = ifelse(client15==0 & client16==0, 1, 0)
) %>%
ungroup() %>%
dplyr::summarize_each(
funs(mean= mean(., na.rm=T)), -c(sample_id, client15, client16)
) %>%
mutate_each(
funs(paste0(round(.,2)*100, "%"))
) %>%
kable(caption="Movement in Sample", format='markdown')
| becameClient_mean | becameControl_mean | stayedClient_mean | stayedControl_mean |
|---|---|---|---|
| 4% | 11% | 38% | 42% |
See sketch of SHS report.
TODO - START HERE
Consider including:
Helpful link for executing code in parallel
if(!file.exists(regFile) || forceUpdate) {
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores, type="FORK")
clusterEvalQ(cl, "plm")
clusterExport(cl, "fieldSoilDat")
clusterExport(cl, "keySoilVars")
clusterExport(cl, "indFeList")
indFeLoop <- parLapply(cl, indFeList, function(mod){
lapply(keySoilVars, function(outcome){
form = lm(reformulate(termlabels = mod, response = outcome), data=fieldSoilDat)
pdf(file=paste("output/", paste0(outcome, paste(mod, collapse = "")), ".pdf", sep = ""))
print(plot(form))
dev.off()
form = plm(form, c("sample_id", "age", "age2"))
rownames(form) = paste(rownames(form), outcome, sep = " ")
return(form)
})
})
stopCluster(cl)
save(indFeLoop, file=regFile)
} else {
load("regFile.RData")
}
not plotting observations with leverage one:
138, 146, 159, 165, 215, 222, 227, 256, 292, 299, 308, 337, 364, 365, 366, 367, 368, 371, 376, 377, 379, 381, 384, 386, 387, 388, 391, 395, 398, 504, 520, 528, 530, 539, 554, 561, 580, 586, 596, 659, 674, 728, 752, 754, 781, 792, 811, 815, 820, 824, 873, 974, 994, 1013, 1015, 1022, 1027, 1029, 1043, 1050, 1077, 1089, 1092, 1107, 1121, 1144, 1152, 1207, 1262, 1279, 1350, 1353, 1355, 1398, 1405, 1445, 1452, 1483, 1526, 1538, 1541, 1560, 1565, 1610, 1633, 1638, 1669, 1671, 1684, 1699, 1722, 1724, 1725, 1734, 1735, 1736, 1743, 1755, 1758, 1791, 1795, 1797, 1808, 1844, 1847, 1862, 1878, 1883, 1895, 1935, 1937, 1949, 1954, 1966, 1969, 1978, 1989, 1991, 1996, 2019, 2544, 2600, 2609, 2767, 2792, 2830, 3038, 3583, 3618, 3682, 3716, 3768, 3807
And combine model outputs into tables for each model
In the individual fixed effect model above, the naive model would only include a client indicator and individual fixed effects. If we add season, we lose significance on almost everything. I’d guess that as we add more likely controls we additionally lose signficance. I’ve included age and age squared along the lines of Hicks et.al.
| Coefficient | 95% Confidence Interval | P-Value | |
|---|---|---|---|
| (Intercept) ph | 5.5000 | 5 to 6 | <0.001 *** |
| as.factor(d_client)1 ph | -0.0330 | -0.087 to 0.021 | 0.24 |
| as.factor(season)2016 ph | -0.3900 | -0.41 to -0.36 | <0.001 *** |
| (Intercept) calcium | 440.0000 | -160 to 1000 | 0.15 |
| as.factor(d_client)1 calcium | -34.0000 | -100 to 36 | 0.34 |
| as.factor(season)2016 calcium | -56.0000 | -84 to -28 | <0.001 *** |
| (Intercept) magnesium | 74.0000 | -0.84 to 150 | 0.053 . |
| as.factor(d_client)1 magnesium | -5.7000 | -14 to 3 | 0.2 |
| as.factor(season)2016 magnesium | -14.0000 | -17 to -10 | <0.001 *** |
| (Intercept) x.organic.carbon | 1.2000 | 0.77 to 1.6 | <0.001 *** |
| as.factor(d_client)1 x.organic.carbon | 0.0170 | -0.033 to 0.067 | 0.5 |
| as.factor(season)2016 x.organic.carbon | -0.3100 | -0.33 to -0.29 | <0.001 *** |
| (Intercept) x.total.nitrogen | 0.1000 | 0.074 to 0.13 | <0.001 *** |
| as.factor(d_client)1 x.total.nitrogen | -0.0032 | -0.0063 to -1.5e-05 | 0.049 * |
| as.factor(season)2016 x.total.nitrogen | -0.0110 | -0.012 to -0.0099 | <0.001 *** |